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Penalized regression models are popularly used in high-dimensional data analysis 
to conduct variable selection and model fitting simultaneously. Whereas success has 
been widely reported in literature, their performances largely depend on the tuning 
parameters that balance the trade-off between model fitting and model sparsity. Ex- 
isting tuning criteria mainly follow the route of minimizing the estimated prediction 
error or maximizing the posterior model probability, such as cross-validation, AIC and 
BIC. This article introduces a general tuning parameter selection criterion based on a 
novel concept of variable selection stability. The key idea is to select the tuning pa- 
rameters so that the resultant penalized regression model is stable in variable selection. 
The asymptotic selection consistency is established for both fixed and diverging dimen- 
sions. The effectiveness of the proposed criterion is also demonstrated in a variety of 
simulated examples as well as an application to the prostate cancer data. 
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1 Introduction 



The rapid advance of technology has led to an increasing demand for modern statistical 
techniques to analyze data with complex structure such as the high-dimensional data. In 
high- dimensional data analysis, it is generally believed that only a small number of variables 
are truly informative while others are redundant. An underfitted model excludes truly 
informative variables and may lead to severe estimation bias in model fitting, whereas an 
overfitted model includes the redundant uninformative variables, increases the estimation 
variance and hinders the model interpretation. Therefore, identifying the truly informative 
variables is regarded as the primary goal of the high-dimensional data analysis as well as its 
many real applications such as the health studies (Fan and Li, 2006). 

Among other variable selection methods, penalized regression models have been popu- 
larly used, which penalize the model fitting with various regularization terms to encourage 
model sparsity, such as the lasso regression (Tibshirani, 1996), the smoothly clipped abso- 
lute deviation (SCAD; Fan and Li, 2001), the adaptive lasso (Zou, 2006), and the truncated 
/i-norm regression (Shen et al., 2012). In the penalized regression models, tuning parameters 
are often employed to balance the trade-off between model fitting and model sparsity, which 
largely affects the numerical performance and the asymptotic behavior of the penalized re- 
gression models. For example, Zhao and Yu (2006) showed that, under the irrepresentable 
condition, the lasso regression is selection consistent when the tuning parameter converges 
to at a rate slower than 0(n _1//2 ). Analogous results on the choice of tuning parameters 
have also been established for the SCAD, the adaptive lasso, and the truncated /i-norm 
regression. Therefore, it is of crucial importance to select the appropriate tuning parameters 
so that the performance of the penalized regression models can be optimized. 

In literature, many classical selection criteria have been applied to the penalized regres- 
sion models, including cross validation (Stone, 1974), generalized cross validation (Craven 
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and Wahba, 1979), Mallows' C p (Mallows, 1973), AIC (Akaike, 1974), BIC (Schwarz, 1978). 
For instances, under certain regularity conditions, Wang et al. (2007) and Wang et al. (2009) 
established the selection consistency of BIC for the SCAD, and Zhang et al. (2010) also 
showed the selection consistency of generalized information criterion (GIC) for the SCAD. 
Most of these criteria follow the route of minimizing the estimated prediction error or maxi- 
mizing the posterior model probability. To the best of our knowledge, few criteria has been 
developed directly focusing on the selection of the informative variables. 

This article proposes a general tuning parameter selection criterion based on a novel 
concept of variable selection stability. Similar stability measures have been studied in the 
context of clustering (Ben-Hur et al., 2002; Wang, 2010) and variable selection (Meinshausen 
and Buhlmann, 2010). The key idea is that if multiple samples are available from the 
same distribution, a good variable selection method should yield similar sets of informative 
variables that do not vary much from one sample to another. The similarity between two 
informative variable sets is measured by Cohen's kappa coefficient (Cohen, 1960), which 
adjusts the actual variable selection agreement relative to the possible agreement by chance. 
The effectiveness of the proposed selection criterion is demonstrated in a variety of simulated 
examples and real applications. More importantly, its asymptotic selection consistency is 
established, showing that the variable selection method with the selected tuning parameter 
would recover the truly informative variable set with probability tending to one. 

The rest of the article is organized as follows. Section 2 briefly reviews the penalized 
regression models. Section 3 presents the idea of variable selection stability as well as the 
proposed kappa selection criterion. Section 4 establishes the asymptotic selection consistency 
of the kappa selection criterion. Simulation studies are given in Section 5, followed by a real 
application in Section 6. Section 7 provides a direct extension of the proposed kappa selection 
criterion. A brief discussion is provided in Section 8, and the Appendix is devoted to the 
technical proofs. 
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2 Penalized least squares regression 



Consider the linear regression model 



p 



y = X/3 + e = #j x (j) + e, 



where y = (yi, • • • ,y n ) T , X = (x 1; • • • ,x„) T = (x (1) , • • • ,x (p) ) with X; = (x a , • • • ,£; P ) T or 
x (i) = ' ' ' i x nj) T , P = (Pi, • • • , Pp) T , E(e) = and cov(e) = E. When p is large, it 
is also assumed that only a small number of (3/s are nonzero, corresponding to the truly 
informative variables. In addition, both y and x^'s are centered, so the intercept can be 
omitted in the regression model. 

The general framework of the penalized regression models can be formulated as 



where || • || is the Euclidean norm, and p\(\Pj\) is a regularization term encouraging sparsity 
in p. Widely used regularization terms include the lasso penalty p\(9) = \0 (Tibshirani, 



1996), the SCAD penalty with p' x (9) = \(l(9 < A) + ^=$± 1(9 > A)) (Fan and Li, 2001), 
the adaptive lasso penalty p\(9) = Xj9 = \9/\/3j\ (Zou, 2006) with being some initial 
estimate of fij, and the truncated /x-norm penalty p\(9) = Amin(l,^) (Shen et al., 2012). 



With appropriately chosen A n , all the aforementioned regularization terms are shown to 
be selection consistent. Here a penalty term is said to be selection consistent if the probability 
that the fitted regression model includes only the truly informative variables is tending to 
one, and A is replaced by A n to emphasize its dependence on n in quantifying the asymptotic 
behaviors. In particular, Zhao and Yu (2006) showed that the lasso regression is selection 
consistent under the irrepresentable condition when y/n\ n — > oo and \ n — > 0; Fan and Li 
(2001) showed that the SCAD penalty is selection consistent when y/n\ n — > oo and A n — > 0; 




(1) 
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Zou (2006) showed that the adaptive lasso penalty is selection consistent when n\„ — > oo 
and y/n\ n — > 0; and Shen et al. (2012) showed that the truncated /x-norm penalty is also 
selection consistent when A n satisfies a relatively more complex constraint. 

Although the asymptotic order of A n is known to assure the selection consistency of 
the penalized regression models, it remains unclear how to appropriately select A n in finite 
sample so that the resultant model in ([T]) with the selected A n can achieve superior numerical 
performance and attain asymptotic selection consistency. Therefore, it is in demand to devise 
a tuning parameter selection criterion that can be employed by the penalized regression 
models so that their variable selection performance can be optimized. 

3 Tuning via variable selection stability 

This section introduces the proposed tuning parameter selection criterion based on a novel 
concept of variable selection stability. The key idea is that if we repeatedly draw samples 
from the population and apply the candidate variable selection methods, a desirable method 
should produce the informative variable set that does not vary much from one sample to 
another. Clearly, variable selection stability is assumption free and can be used to tune any 
penalized regression model. 

3.1 Variable selection stability 

For simplicity, we denote the training sample as z n . A base variable selection method 
fy(z n ; A) with a given training sample z n and a tuning parameter A yields a set of selected 
informative variables A C {1, ■ ■ • ,p}, called the active set. When \l/ is applied to various 
training samples, different active sets can be produced. Supposed that two active sets A\ 
and Ai are produced, the agreement between A\ and Aq, can be measured by Cohen's kappa 
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coefficient (Cohen, 1960), 



_ Pr(a) - Pr(e) 
k{A u A 2 )- 1 _ pr(e) ■ (2) 



Here the relative observed agreement between A\ and A2 is Pr(a) = (nn + n 22 )/p, and the 
hypothetical probability of chance agreement Pr(e) = (n n + ni 2 )(nn + n 2 i)/p 2 + (ni 2 + 
n 22 )(n 21 + n 22 )/p 2 , with nn = |*4i H.4 2 |, ^12 = |^i n,4 2 |, n 2 i = \A{ D A|, n 22 = |^ n^|, 
and I ■ I being the cardinality of a set. Note that —1 < k(*4i,*4 2 ) < 1, where k.(Ai,A 2 ) = 1 
when A\ and A 2 are in complete agreement with ni 2 = n 2 i = 0, and k(Ai,A 2 ) = —1 when 
A\ and A 2 are in complete disagreement with riu = n 22 = and rii 2 = n 2 i = p/2. Based on 
d2J), the variable selection stability is defined as follows. 

Definition 1 The variable selection stability o/\l/(-;A) is defined as 

A, n) = E («(*(^; A), A))) , (3) 

where the expectation is taken with respect to Z™ and Z%, two independent and identically 
training samples of size n, and ty(Z™ ; A) and ^{Z^\ A) are two active sets obtained by applying 
\G r (-; A) to Z™ and Z^, respectively. 

By definition, —1 < s(^f, A, n) < 1, and large value of s( 1 J r , A, n) indicates a stable variable 
selection method A). Note that the definition of s(^, A, n) relies on the unknown pop- 
ulation distribution, therefore it needs to be estimated based on the only available training 
sample in practice. 

3.2 Kappa selection criterion 

This section proposes an estimation scheme of the variable selection stability based on cross 
validation, and develops a kappa selection criterion to tune the penalized regression models 
by maximizing the estimated variable selection stability. Specifically, the training sample z n 
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is randomly partitioned into two subsets z™ and z™ with to — [n/2\ for simplicity. The base 
variable selection method A) is applied to two subsets separately and then two active 
sets Aw and A2X are obtained, and s(^, A, to) is estimated as k(Ai\, A2\)- Furthermore, 
in order to reduce the estimation variability due to the splitting randomness, multiple data 
splitting can be conducted and the averaged estimated variable selection stability over all 
splittings is computed. The selected A is then the one maximizing the averaged estimated 
variable selection stability. The details of the proposed kappa selection criterion are given 
as follows. 

Algorithm 1 (kappa selection criterion) : 

Step 1. Randomly partition (xi, • • • ,x n ) T into two subsets z\ h = (x^ 6 , • " ,x^) T and 

y *b _ ( v *b . . . „*6 \T 

Step 2. Obtain ^4*^ and A* 2 \ from ^(zf,\) and \I>(2;2 6 , A) respectively, and the variable 
selection stability of A) in the 6-th splitting is estimated as 

Step 3. Repeat Steps 1-2 for B times. The averaged estimated variable selection stability 
of *(•; A) is then 

B 

s(q?, A, m) = B- 1 «* 6 (*> A > m )- 

6=1 

Step 4- Compute s(\&, At, m) for a sequence of A t 's, and set A = mm\ t {X t : A t G A n } with 



K = U 



s(*, A,m) 



max At s(*, A t , to) 



> 1 - 



Note that the treatment in Step 4 is necessary since some informative variables may have 
relatively weak effect compared with others. A large value of A may produce an active set 
that consistently overlooks the weakly informative variables, which leads to an underfitted 
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model with large variable selection stability. To assure the asymptotic selection consistency, 
the thresholding value a n in Step 4 needs to be small and converges to as n grows. Setting 
a n = 0.1 in the numerical experiments yields satisfactory performance based on our limited 
experience. Furthermore, the sensitivity study in Section 5 suggests that a n has very little 
effect on the selection performance when it varies in certain range. 

In Steps 1-3, the estimation scheme based on cross-validation can be replaced by other 
data re-sampling strategies such as bootstrap or random weighting, which do not reduce the 
sample size in estimating A* b x and A* 2 b x , but the independence between A* b x and A* 2 b \ will 
no longer hold. Furthermore, since the true model is assumed to be sparse and containing 
at least some informative variables, any A leading to an active set with all variables or no 
variable will be excluded from the comparison by setting the corresponding variable selection 
stability as —1. 

4 Asymptotic selection consistency 

This section presents the asymptotic selection consistency of the proposed kappa selection 
criterion. Without loss of generality, we assume that only the first p < p variables are 
informative, and denote the truly informative variable set as At = {1,-'' ?£>o} and the 
uninformative variable set as A T = {po + 1, • • • ,p}. Furthermore, we denote r n -< s n if r n 
converges to at a faster rate than 

Sni r n ~ s n if v n converges to at the same rate as s n , 
and r n ■< s n if r n converges to at a rate not slower than s n . 

4.1 Consistency with fixed p 

To establish the asymptotic selection consistency with fixed p, the following technical as- 
sumptions are made. 
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Assumption 1: There exist r n and s n such that the base variable selection method is 
selection consistent if r n -< A n -< s n . That is, 

P(A\ n = At) > 1 — e n , for some e n — > 0. 

Assumption 2: For r n in Assumption 1, if A n ^ r n , the base variable selection method 
is overfitted in that P(At Q A\ n ) — > 1 and there exists a constant c > such that for 
sufficiently large n, 

p(A t U 0'} C A Xn ^j > c , for any j G A C T . (4) 

In Assumption 1, r n and s n specify an asymptotic working interval for A n so that the base 
variable selection method is selection consistent. Assumption 2 is necessary since it implies 
a natural order of the variable selection stability with respect to A n and it excludes the 
degenerate variable selection methods that always produce the same A\ n regardless of the 
training sample. The inequality (jl]) can be replaced by a slightly stronger assumption that 
the distribution of {Xu^,j G A T } is exchangeable and the base variable selection method is 
no worse than random guessing (Meinshausen and Buhlmann, 2010). 

Note that Assumptions 1 and 2 are mild in that they are satisfied by many popular 
variable selection methods. For instances, Lemma [1] shows that Assumptions 1 and 2 are 
satisfied by the lasso regression, the SCAD, and the adaptive lasso. The assumptions can also 
be verified for other methods such as elastic- net (Zou and Hastie, 2005), adaptive elastic net 
(Zou and Zhang, 2009), group lasso (Yuan and Lin, 2006), and adaptive group lasso (Wang 
and Leng, 2008). 

Lemma 1 Assumptions 1 and 2 are satisfied by the lasso regression and the SCAD with 
r n = n~ x l 2 and s n = o(l) under the assumptions in Zhao and Yu (2006) or Fan and Li 
(2001), and by the adaptive lasso with r n = n~ l and s n = n~ 1 / 2 under the assumptions in 
Zou (2006). 

9 



Given that the base variable selection method is selection consistent with appropriately 
selected A n 's, Theorem 1 shows that the proposed kappa selection criterion is able to identify 
such A n 's. 

Theorem 1 Under Assumptions 1 and 2, any variable selection method in (TJP with X n 
selected as in Algorithm 1 with a n >~ e n is selection consistent. That is, as n — > oo, 

P(A Xn =A T )^l. 

Theorem 1 claims the asymptotic selection consistency of the proposed kappa selection 
criterion when p is fixed, which indicates that, with probability tending to one, the selected 
active set by the resultant variable selection method with tuning parameter A n contains only 
the truly informative variables. It is worthy pointing out that as long as a n converges to 
not too fast, the kappa selection criterion is guaranteed to be consistent. Therefore, the value 
of a n is expected to have little effect on the performance of the kappa selection criterion, 
which agrees with the sensitivity study in Section 5. 

4.2 Consistency with diverging p n 

In high-dimensional data analysis, it is of interest to study the asymptotic behavior of the 
proposed kappa selection criterion with diverging p n , where po may also diverge with n. To 
accommodate the diverging p n scenario, the technical assumptions are modified as follows. 

Assumption la: There exist r n and s n such that if r n -< X n -< s n the base variable 
selection method is selection consistent in that 

P(A Xn =At)>1- e n , 
where e n -< Pn lc oi.Pn), and co(p n ) is defined as in Assumption 2a. 
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Assumption 2a: For r n in Assumption 1, if A n -< r n , the base variable selection method 
is overfitted in that P(At Q A\ n ) — > 1 and for sufficiently large n, 

p(At U {j} C A Xn ) > c (Pn) > 0, for any j G ^, (5) 

where co(p n ) is allowed to converge to as p n diverges. 

Compared with the previous assumptions in Section 4.1, Assumption la is slightly stronger 
than Assumption 1 in that it requires the base variable selection method to be selection con- 
sistent at a rate faster than p^CoiPn), and Assumption 2a is weaker than Assumption 2 as 
co(pn) is allowed to converge to 0. 

Theorem 2 Under Assumptions la and 2a, any variable selection method in (QP with X n as 
selected in Algorithm 1 with a n — > and e n /a n -< Pn lc o{Pn) is selection consistent. 

Theorem [2] shows the asymptotic selection consistency of the proposed kappa selection 
criterion with diverging p n , where the diverging speed of p n is bounded as in p^coipn) >~ 
e n and depends on the base variable selection method. For example, the exchangeability 
assumption in Meinshausen and Buhlmann (2010) implies Assumption 2a with Co(p n ) > p" 1 , 
and thus p' 1 >- el/ 2 is sufficient for Assumption la. In addition, Zhao and Yu (2006) showed 
that Assumption la is satisfied by the lasso regression with r n = n k ^ 2 pl/ 2 , s n = n^ 9 ^ 92 ^ 2 
and e n = 0(p n n k X~ 2k ), where the error term is assumed to have finite 2/c-th moment and 
p n = o(n^ 92 ~ 9l ^ k ) with < gi < gi < 1. However, it is relatively difficult to verify Assumption 
la for other variable selection methods with diverging p n as their convergence rate e n 's are 
not explicitly specified (Fan and Peng, 2004; Huang et al., 2008). 
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5 Simulations 



This section examines the effectiveness of the proposed kappa selection criterion in simulated 
examples. Its performance is compared against a number of popular competitors, includ- 
ing Mallows' C p (C p ), BIC, 10-fold cross-validation (CV), and generalized cross validation 
(GCV). Their formulations are given as follows, 

SSE 2df 

C P W = —r^ + > (6) 

BIC( x) = ™b + !5SW£, (7) 

no 1 n 
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cv(x) = £ E (yk-x T J {s \x)) , 



GCV(X) = ^M- , (9) 

n{l-df/n) 2 

where SSE = ||y — X/3|| 2 , a 2 is an estimated a 2 based on the saturated model, and df is 
estimated as the number of nonzero variables in /5(A) (Zou et al., 2007). In (jSJ), T s and T~ s 
are the training and validation sets in CV, and ^ S \X) is the estimated (3 using the training 
set T s and tuning parameter A. The optimal A is then selected as the one that minimizes 
the corresponding C p (X), BIC(X), CV(X), or GCV(X), respectively. 

To assess the performance of each selection criterion, we report the percentage of se- 
lecting the true model over all replicates, as well as the number of correctly selected ze- 
ros and incorrectly selected zeros in /3(A). The final estimate /3(A) is obtained by re- 
fitting the standard least squares regression based only on the selected informative vari- 
ables. We then compare the prediction performance through the relative prediction error 
RPE = e(x t 0(X) - x T /?)% 2 (Zou, 2006). 
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5.1 Scenario I: fixed p 

The simulated datasets (x i; ?/j)™ =1 are generated from the model 

8 

y = x T /3 + e = ^ x (i)/ 3 i + e ' 

where /3 = (3, 1.5, 0, 0, 2, 0, 0, 0) T , x (i) and e are generated from standard normal distribution, 
and the correlation between and xy) is set as 0.5' i_J L This example has been commonly 
used in literature, including Tibshirani (1996), Fan and Li (2001), and Wang et al. (2007). 

For comparison, we set n = 40, 60 or 80 and implement the lasso regression, the adaptive 
lasso and the SCAD as the base variable selection methods. The lasso regression and the 
adaptive lasso are implemented by package 'LARS' (Efron et al., 2004) and the SCAD is 
implemented by package 'ncvreg' (Breheny and Huang, 2011) in R. The tuning parameter 
A's are selected via each selection criterion, optimized through a grid search over 100 grid 
points |io~ 2+4 '/"; I — 0, . . . , 99}. The number of splittings for the kappa selection criterion 
is B = 20. Each simulation is replicated 100 times, and the percentage of selecting the true 
active set, the averaged number of correctly selected zeros (C) and incorrectly selected zeros 
(I), and the relative prediction error (RPE) are summarized in Tables [T]l2] and Figured! 

Tables [T]|2] and Figure Q] about here 

Evidently, the proposed kappa selection criterion delivers superior performance against 
its competitors in terms of both variable selection accuracy and relative prediction error. As 
shown in Table [TJ the kappa selection criterion (Ks) has the largest probability of choosing 
the true active set and consistently outperforms other selection criteria, especially when the 
lasso regression is used as the base variable selection method. As the sample size n increases, 
the percentage of selecting the true active set is also improving, which confirms the selection 
consistency in Section 4. 
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Table [2] shows that the kappa selection criterion yields the largest number of correctly 
selected zeros in all scenarios, and it yields almost perfect performance for the adaptive lasso 
and the SCAD. In addition, all selection criteria barely select any incorrect zeros, whereas 
the kappa selection criterion is relatively more aggressive in that it has small chance to 
shrink some informative variables to zeros for the lasso regression. All other criteria tend 
to be conservative and include some uninformative variables, so the numbers of correctly 
selected zeros are significantly less than 5. 

Besides the superior variable selection performance, the kappa selection criterion also de- 
livers accurate prediction performance and yields small relative prediction error as displayed 
in Figure [TJ Note that other criteria, especially C p and GCV, produce large relative predic- 
tion errors, which could be due to their conservative selection of the informative variables. 

To illustrate the effectiveness of the kappa selection criterion, we randomly select one 
replication with n = 40 and display the estimated variable selection stability as well as the 
results of detection and sparsity for various A's for the lasso regression. The detection is 
defined as the percentage of selecting the truly informative variables, and the sparsity is 
defined as the percentage of excluding the truly uninformative variables. In Figure [2} it 
is clear that there is a positive relevance between the variable selection stability and the 
values of detection and sparsity. More importantly, the selection performance of the kappa 
selection criterion is very stable against a n when it is small. In specific, we apply the kappa 
selection criterion on the lasso regression for a n = {y^,; I = 0, . . . ,30} and compute the 
corresponding averaged RPE over 100 replications. As shown in the last panel of Figure EJ 
the averaged RPEs are almost the same for a n G (0,0.13), which confirms the theoretical 
results in Section 4. 

Figure 12] about here 
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5.2 Scenario II: diverging p n 

Next we compare all the selection criteria in the scenario with diverging p n with a similar 
simulation model as in Scenario I, except that (3 = (5, 4, 3, 2, 1, 0, • • ■ , 0) T and p n = [y/n\. 
More specifically, four cases are examined: n = 100, p n = 10; n = 200, p n = 14; n = 
400, p n = 20; and n = 800, p n = 28. A similar simulation example is also studied in 
Tibshirani (1996). The percentage of selecting the true active set, the averaged number of 
correctly selected zeros (C) and incorrectly selected zeros (I), and the relative prediction 
error (RPE) are summarized in Tables [3H and Figure [31 

Tables I3H and Figure [3] about here 

The proposed kappa selection criterion still outperforms other competitors in both vari- 
able selection and prediction performance. As illustrated in Tables [3JH1 the kappa selection 
criterion delivers the largest percentage of selecting the true active set among all the selection 
criteria, and achieves perfect variable selection performance for the adaptive lasso and the 
SCAD, and for the lasso regression with n > 400. Furthermore, as shown in Figure El the 
kappa selection criterion yields the smallest relative prediction error across all cases. 

6 Real application 

In this section, we apply the kappa selection criterion to the prostate cancer data (Stamey et 
al., 1989), which were used to study the relationship between the level of log(prostate specific 
antigen) (Ipsa) and a number of clinical measures. The dataset consisted of 97 patients who 
had received a radical prostatectomy, and eight clinical measures were log(cancer volume) 
(Icavol), log(prostate weight) (Iw eight), age, log(benign prostaic hyperplasia amount) (Ibph), 
seminal vesicle invasion (svi), log(capsular penetration) (lep), Gleason score (gleason) and 
percentage Gleason scores 4 or 5 (pgg45). 
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The dataset is randomly split into two halves: a training set with 67 patients and a 
test set with 30 patients. Similarly as in the simulated examples, the tuning parameter A's 
are selected through a grid search over 100 grid points {i(r 2 + 4i / 99 ; I = 0, . . . , 99}. Since it 
is unknown whether the clinical measures are truly informative or not, the performance of 
all the selection criteria are compared by computing their corresponding relative prediction 
errors (RPE) on the test data in Table 

Table [5] about here 

As shown in Table El the proposed kappa selection criterion yields the sparsest model and 
achieves the smallest relative prediction errors for the lasso regression and the SCAD, while 
the relative prediction error for the adaptive lasso is comparable to the minima. Specifically, 
the lasso regression and the SCAD with the kappa selection criterion include Icavol, Iweight, 
Ibph and svi as the informative variables, and the adaptive lasso with the kappa selection 
criterion selects only Icavol, Iweight and svi as the informative variables. As opposed to the 
sparse regression models produced by other selection criteria, the variable age is excluded 
by the kappa selection criterion for all base variable selection methods, which agrees with 
the findings in Zou and Hastie (2005). 

7 Extended selection criterion 

In this section, we present a direct extension by combining the kappa selection criterion and 
the conventional cross-validation, which does not require the pre-specified thresholding value 
a n in Algorithm 1. 

To compute the cross-validation error, for Z{ = {(yl, x*), ■ ■ ■ , (y^, x^)} and Z\ = 
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m+l' ^m+l)' ' we defme 



A) =n 



-i 



( 




n 




) 



(10) 



where /3i\ and /3 2 a are obtained based on Z\ and Zg, respectively. The details of the extended 
selection criterion proceed as follows. 

Algorithm 2 (extended selection criterion): 

Steps 1-2. The same as those in Algorithm 1. 

Step 3. Calculate CV(Z{\ Z* b ; A) as in (HDD - 

Step 4- Repeat Steps 1-3 for B times and obtain the following ratio, 



Step 5. Compute es(A) for a sequence of A's and select A = argmaxA es(A). 

The criterion fflTT) does not require the thresholding value a n since it will get small when A 
deviates from the true value. In specific, small A leads to small variable selection stability as 
discussed in Section 3, whereas large A over-penalizes the model and may exclude some truly 
informative variables, and thus leads to large cross-validation error. To demonstrate the 
effectiveness of the extended selection criterion, we repeat the simulated example Scenario 
I for n = 40 on the lasso regression. The percentage of selecting the true active set, the 
averaged number of correctly selected zeros (C) and incorrectly selected zeros (I), and the 
averaged RPE are summarized in Table [6j Figure H] reports the results of detection and 
spar si ty for various A's as well as the extended selection criterion in (TTTj) on the same sample. 



As expected the extended selection criterion is more conservative in variable selection 



B 



B 




(11) 



Table [6] and Figure H] about here 
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than the kappa selection criterion because of the influence of cross-validation. It performs 
slightly worse than the kappa selection criterion, but much better than other criterions. 

8 Discussion 

This article proposes a novel tuning parameter selection criterion based on the concept of 
variable selection stability. Its key idea is to select the tuning parameter so that the resultant 
variable selection method is stable in selecting the informative variables. The proposed 
criterion delivers superior numerical performance in a variety of simulated examples and 
real applications. Its asymptotic selection consistency is also established for both fixed and 
diverging dimensions. Furthermore, it is worth pointing out that the idea of stability is 
general and can be naturally extended to a broader framework of model selection, such as 
the penalized nonparametric regression (Xue et al., 2010) and the penalized clustering (Sun 
et al., 2012). 

Appendix: technical proofs 

Proof of Lemma [Tl We prove Lemma [1] for (1) the lasso regression, (2) the adaptive lasso, 
and (3) the SCAD, respectively. 

(1) : The lasso regression. The proof follows immediately after some existing results in 
literature. When n}l 2 \ n — > oo and A„ — > 0, Assumption 1 is satisfied by the lasso regression 
under the irrepresentable condition following Zhao and Yu (2006) and Yuan and Lin (2006), 
and Assumption 2 is satisfied by the lasso regression following Zou (2006) and Bach (2008). 

(2) : The adaptive lasso. First, Zou (2006) showed that the adaptive lasso is selection 
consistent when n\ n — > oo and y/n\ n — > 0, so Assumption 1 is satisfied. 
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To verify Assumption 2, we denote (3* as the true coefficient, (3 = (3* + and 



(«)= y-E x o-)(^ + 



where (3^ is the estimator from the lasso regression. Let u n = argmin \E'„(-u), (3 n = (3* + 
and V n (u) = ^„(m) - ^„(0) with 



V n (u) = u T (^-^)u - 2e + v^An^ 

7 = 1 



i#i 



Note that 



->■ C, A W T ~ iV(0,SC), and nA n ->■ a with < a < oo implies 
^fn\ n ->• 0. Following similar treatment as in Zou (2006), v^-W^I/^+^l ~ )/\Pf\ ^ 
when f3* ^ 0, and ^(1/3* + - = when /3* = 0. 

If a = 0, the asymptotic normality of implies that j^^Lj when (3* = 0, and then 
it follows from the Slutsky's theorem that 



V n {u) A u T Cu - 2W T u. 



Therefore, u n — > C^W, which implies that P(j £ A\ n ) — > 1 for all j £ {1, • • • ,p}, and thus 
Assumption 2 is satisfied. 

If < a < oo, the asymptotic normality of j3 n still holds, which implies that P(At Q 
A\ n ) — > 1. It then suffices to consider the event j ^ ^4 An for any j £ A?- Note that when 
j <£. A\ n , the Karush-Kuhn- Tucker (KKT) conditions imply that 



2xg. ) (y-X/9 n ) 
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In addition, 



2x5 } (y - X/3„) xgjXVn^ - fi n ) 2x^6 



+ 



n 



By the asymptotic normality of /3 n and 



— > C, the Slutsky's theorem implies that 



2xg ) X v /^(/3*-/3 n )/n 4 JV(0, Ai) for some A 1; and 2y^ j) e/ y /n -4 N(0, 4||x (j) || 2 S| i )- There- 
fore, as n\ n — > a with < a < oo, 



< p 



2xg. ) (y-X/9 n ) 



< n- 



A, 



2x^X^/3* - 



+ 



2x 5) e 



Iv^l <«A n ) < 1-ci, 



for some constant ci. Therefore, Assumption 2 is satisfied with Co < ci. 

(3): The SCAD. First, Fan and Li (2001) showed that the SCAD is selection consistent 
when ^/n\ n — > oo and \ n — > 0, so Assumption 1 is satisfied. 

Next, we show that the SCAD will be overfitted when y/n\ n — y a with < a < oo. By 
Theorem 1 of Fan and Li (2001), /3 n is a ^/n-consistent estimate of (3* when A n — > 0, and 
hence that P(At Q A\ n ) — > 1. It then suffices to consider the event j ^ A\ n for any j e »4y. 
In fact, the SCAD minimizes 



W) = ||y-E x o')^|| +«Epa-(IA-I), 



(12) 



where the penalty term satisfies p' x (9) = X(l(9 < A) + { J*_^+ 1(9 > A)) for some 7 > 2 and 
9 > 0. For any /3 e {/3 : ||vH4 - < c 2 }, then 



gg(g) 



^(y-X^+np^d^Dsgn^-) 



— nA r 



' 2 x^X^/3* -/3) 2x^ 



P ; n (|^|)sgn(^ 



x/n\ r , 



A r , 
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where X T X/n -> C, || y/n(0*-0)\\ < \\ \/n(p*-0 n )\\ + \\y/n(p n -p)\\ is bounded in probability, 
and 2xJ. )e /v^ 4 N{0, 4||x 0) || 2 E^)- In addition, p^fl&D/A,, = 7(0 < A n ) + 1 g^7(# > 
A n ) < 1- Therefore, as \/n\ n — >■ a with < a < oo, 



P 



P 



2xf..XVn(/8*-^) , 2xf.,e 



+ 



v 7 " 



> 



pL(l/?jl) s g n (& 



2x^X^(^-/3) . 2x£,e 



+ 



n 



n 



A» 



— > 



c 2 , if a > 
1, if a = 0, 



for some constant c 2 > 0. Therefore, if a > 0, there exists a constant Co > such that with 
a positive probability Co, 



gW) 

gg(g) 



< when < & < Mn" 1/2 ; 
> when - Mn' 1/2 < & < 0, 



(13) 
(14) 



with iV7 sufficient large such that P( sup|j u || =A;? Q(£* + (n- 1 ' 2 + On)u) > Q(/3*)) -> 1 and 
a„ = max{p^(|/3*|) : /3* 7^ 0}, which implies that P(/3j 7^ 0) > c for sufficiently large n. If 
a = 0, with probability tending to 1, 



gW) 



< when - Mn~ 1/2 < 0$ < Mn' 1 



/2 



(15) 



and hence P(/3j 7^ 0) — >■ 1. Therefore, Assumptions 2 is satisfied by the SCAD with r n = 
n -1 / 2 and s n = o(l). This ends the proof of Lemma [H ■ 
Additional Notations: Note that any variable selection method is trivially stable if it 
always selects the complete variable set or the empty variable set, however it violates the 
assumption that the true active set is neither the complete set nor the empty set. In Algo- 
rithm 1, the variable selection stabilities of such trivial methods are set as —1 and thus will 
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never be selected. Therefore, it suffices to focus on the set of A's that lead to non-degenerate 
variable selection methods. Specifically, for some fixed constant 5 > 0, define 

K = {A : P{A X + 0) > S and P(A X ± {1, . . . ,p}) > s}, 

and a set of A's that lead to non-degenerate stable variable selection methods as 

A n = | A £ A n : P(s(^f, X,m) > 1 — T] n ) > 1 — for some r] n -»■ and £ n -)■ o|, (16) 

where m = |_§J and the probability P is taken with respect to the training sample. 

Lemma 2 For A n defined as in Assumption 1, the resultant variable selection method is 
selection consistent in that P(A\ n = At) > 1 — e n for some e n — > ; then for any r) n >- e n , 

P A n , m) > 1 - ?? n ) > 1 - 2e n /r] n , 

and hence that X n £ A n . 

Proof of Lemma [2t For clarity, we denote A n satisfying Assumption 1 as A* , and then 
the selection consistency implies that P(A\^ = At) > 1 — e n for some e n — > 0. We further 
denote A[ b x » and A£ x » as the corresponding active sets obtained from two sub-samples at 
the 6-th random splitting. Then the estimated variable selection stability based on the 6-th 
splitting can be bounded as 

P(>(*, \* n ,m) = l) = P(A? K = A* 2 b K ) > P(A? K =A t ) 2 >(1- e n f > 1 - 2e n . 

By the fact that < s* b (^, X* n ,n) < 1, 

B 

E(s(V, K, m)) = E (V 1 «* 6 (*, A;, m)) = E (s* b (V, X* n , m)) > 1 - 2e n . 

6=1 
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In addition, since < s(\I/,A*,n) < 1, and the Markov inequality yields that 
P(l-s(*,X* n ,m )> Vn ) <-± J -<—, 

V ' Vn Vn 

which implies the desired result immediately. ■ 
Lemma [2] shows that if a variable selection method is selection consistent, its variable 
selection stability converges to 1 in probability. It also assures that there always exists X n 
such that the resultant variable selection method is stable and non-degenerate. 
Proof of Theorem 1: Let r n -< A* -< s n , Assumption 1 implies that P(A\^ = At) > 1 — e„ 
for some e n — > 0, and Lemma [2] implies that A* G A n . Denote A n = min^{A : A G A n } with 
min representing minimization up to a constant, and hence that X n ^ A*. Then we prove 
Theorem 1 in two steps. Step 1 shows that the variable selection method with X n is selection 
consistent, and step 2 assures that P(X n ~ A n ) — > 1 with A n being defined as in Algorithm 
1. The desired result follows immediately after these two steps. 

Step 1 is proved by contradiction. If the variable selection method with A n is not selection 
consistent, then by Assumption 1 we have A n / r n or A n -fi s n . Without loss of generality, we 
assume that the limits of r~ x A n and s~ 1 A n exist (where the limit of s~ 1 A n can be infinity), 
since otherwise we can focus on the corresponding convergent subsequences r~^X Um and 
s~^X nm . Then X n )f r n implies that (1) r J ^ 1 A n — >■ a > 0, and X n -/< s n implies that (2) 
s,^ 1 A n — > b > 0, where b can be infinity. We now show that both (1) and (2) will lead to 
contradictions. 

If case (2) occurs, X n >z s n y A*, which contradicts with the fact that A n ^ A*. 

If case (1) occurs, by Assumption 2, there exists a constant cq > such that for any 
j G A T , P{At U {j} C A^) > c for sufficiently large n. In addition, there also exists 
ji G A T such that P(ji ^ A% ) > C3 > when n is sufficiently large, since otherwise 



23 



P{A\ = {!,... ,p}) — > 1 which contradicts with the fact that A n e A n . Therefore, 

for sufficiently large n, where the last inequality follows from the fact that the two sub- 
samples are independent. 

Since A*\ ^ A*\ implies that s* 6 (\I>, X n ,m) < c 4 with c 4 = max^^^ k(Ai,A 2 ) < — 
where A\,Ai C {1, • • • ,p}, we have for sufficiently large n, 

p(s* b (V,~\ n ,m) < c 4 ) > c c 3 . 

Therefore, for any B > and sufficiently large n, 

B 

E A n , m)) = E [B- 1 ^ A n , m)) = £ (V 1 ^, A n , m)) < 1 - c c 3 (l - c 4 ), 

6=1 

which is a constant strictly less than 1. By the Markov inequality, for any r\ n — > 0, 

/ ~ \ E[s(y,\ n ,m)) 1 _ cc (i_ c \ 
P(s(^,\ n: m)>l-rj n ) <-^— ^< p ^Z^i_ CoC3 (i_ C4 ). (17) 

This contradicts with the fact that P(s(\l/, A n , to) > 1 — rj n ) > 1 — £„ for some £ n — > 0. This 
ends the proof of step 1. 

Next we show that P(A n ~ A n ) — > 1. On one hand, setting a n — > and a n >- e n in 
Algorithm 1 yields that 



s(*, A n , to) > (1 - a n ) max s(\P, A, to) > (1 - a n )s(^, A* , to). 

A 
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Then by Lemma [2] and the fact that a n y e n , P^s(^f, A* , m) > 1 — a n j > 1 — Therefore, 

p(s(tf,A n ,m) > (l-a n )(l-<)) > 1-— . 

Since (1 — a n ) 2 > 1 — 2a n , a n — > and e n /a n — >■ 0, we have P(A n G A n ) — >■ 1, which implies 
that P(A„ h A„) ->• 1. 

On the other hand, since A n = min A {A : A G A n } and a n — > 0, 

p / A m) > 1 _ a \ = P (§^X,m)>(l-a n )maxs(*,\,m)) 

VmaxA s(W, A, m) / V a / 

and hence that P(A n G A n ) — )■ 1, which implies that P(A n y A n ) — >■ 1. Therefore, step 2 is 
proved, and Theorem 1 follows immediately after steps 1 and 2. ■ 
Proof of Theorem [2} In the diverging p n case, we denote the set of A's that lead to 
non-degenerate stable variable selection methods as 

A Pn = |A G A n : P(s(*,A,m) > l-r) n ) > l-£„for some r] n ->■ and e n -< £ n -< p' 1 Co{p n )j , 

where A Pn depends on the dimension p n . We further denote X Pn = min A {A : A G A Pn } with 
min representing minimization up to a constant. 

First, since e n -< Pn lc o(Pn), it implies that there always exists r\ n — \ such that e n -< 
€n/Vn -< Pn lc o(Pn), and thus A* G A Pn by Lemma 2. Next, we prove Theorem |2]in the same 
two steps as in the proof of Theorem 1. Step 1 shows that the variable selection method 
with A Pn is selection consistent, and step 2 assures that P(A Pn ~ A n ) — > 1 with A n being 
defined as in Algorithm 1. 

Both steps can be shown similarly as in the proof of Theorem 1 after some slight modifi- 
cation. In fact, Step 1 can be showed by deriving similar contradictions, except that in ( fl71) . 
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since c 4 < 

'pn 



p(«(*,Jw.m> > i-n.) < L^Miz^ < L^faih , 

\ / 1 - ?7„ 1 - 77 n 

which still leads to contradiction with the fact that \ Pn G A Pn . Step 2 can be shown similarly 
by setting a n — > and e n /a n -< Pn lc o(Pn) in Algorithm 1, which yields that P^s(^, A n , m) > 
(l-a n )(l-a n )) >1-^, and 

Vmax A s(W, A, m) /V / 

This ends the proof of Theorem 2. ■ 



References 

[1] AKAIKE, H. (1974). A New Look at the Statistical Model Identification. IEEE Trans- 
actions on Automatic Control, 19, 716-723. 

[2] BACH, F.R. (2008). Bolasso: Model Consistent Lasso Estimation Through the Boot- 
strap. In Proceedings of the International Conference on Machine Learning (ICML). 

[3] Ben-Hur, A., Elisseeff, A. and Guyon, I. (2002). A Stability Based Method for 
Discovering Structure in Clustered Data. Pacific Symposium on Biocomputing, 6-17. 

[4] Breheny, P. and Huang, J. (2011). Coordinate Descent Algorithms for Noncon- 
vex Penalized Regression, with Applications to Biological Feature Selection. Annals of 
Applied Statistics, 5, 232-253. 

[5] Cohen, J. (1960). A Coefficient of Agreement for Nominal Scales. Educational and 
Psychological Measurement, 20, 37-46. 



26 



[6] Craven, P. and Wahba, G. (1979). Smoothing Noisy Data with Spline Functions: 
Estimating the Correct Degree of Smoothing by the Method of Generalized Cross- 
Validation. Numerische Mathematik, 31, 317-403. 

[7] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least Angle 
Regression. Annals of Statistics, 32, 407-451. 

[8] Fan, J. and Li, R. (2001). Variable Selection via Nonconcave Penalized Likelihood and 
Its Oracle Properties. Journal of the American Statistical Association, 96, 1348-1360. 

[9] Fan, J. and Li, R. (2006). Statistical Challenges with High Dimensionality: Fea- 
ture Selection in Knowledge Discovery. In Proceedings of the International Congress of 
Mathematicians, 3, 595-622. 

[10] FAN, J. AND PENG, H. (2004). Nonconcave Penalized Likelihood with A Diverging 
Number of Parameters. Annals of Statistics, 32, 928-961. 

[11] Huang, J., Ma, S. and Zhang, C. H. (2008). Adaptive Lasso for Sparse High- 
dimensional Regression Models. Statistica Sinica, 18, 1603-1618. 

[12] Mallows, C. (1973). Some Comments on Cp. Technometncs, 15, 661-675. 

[13] Meinshausen, N. and Buehlmann, P. (2010). Stability Selection. Journal of the 
Royal Statistical Society, Series B, 72, 414-473. 

[14] Nishii, R. (1984). Asymptotic Properties of Criteria for Selection of Variables in Mul- 
tiple Regression. Annals of Statistics, 12, 758-765. 

[15] Schwarz, G. E. (1978). Estimating the Dimension of a Model. Annals of Statistics, 
6, 461-464. 



27 



[16] Shen, X., Pan, W., Zhu, Y. and Zhou, H. (2012). On L0 Regularization in High- 
dimensional Regression. Journal of the American Statistical Association, to appear. 

[17] Stamey, T.A., Kabalin, J.N., McNeal, J.E., Johnstone, I.M., Freiha, F., 
Redwine, E.A. and Yang, N. (1989). Prostate Specific Antigen in the Diagnosis 
and Treatment of Adenocarcinoma of the Prostate: II. Radical Prostatectomy Treated 
Patients. Journal of Urology, 141, 1076-1083. 

[18] Stone, M. (1974). Cross-validatory Choice and Assessment of Statistical Predictions. 
Journal of the Royal Statistical Society, Series B, 36, 111-147. 

[19] Sun, W., Wang, J. and Fang, Y. (2012). Regularized K-means Clustering of High- 
dimensional Data and Its Asymptotic Consistency, Electronic Journal of Statistics, 6, 
148-167. 

[20] Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of 
the Royal Statistical Society, Series B, 58, 267-288. 

[21] Wang, H. and Leng, C. (2008). A Note on Adaptive Group Lasso. Computational 
Statistics and Data Analysis, 52, 5277-5286. 

[22] Wang, H., Li, R. and Tsai, C. L. (2007). Tuning Parameter Selectors for the 
Smoothly Clipped Absolute Deviation Method. Biometrika, 94, 553-568. 

[23] Wang, H., Li, B. and Leng, C. (2009). Shrinkage Tuning Parameter Selection with 
A Diverging Number of Parameters. Journal of the Royal Statistical Society, Series B, 
71, 671-683. 

[24] Wang, J. (2010). Consistent Selection of the Number of Clusters via Cross Validation. 
Biometrika, 97, 893-904. 



28 



[25] Xue, L., Qu, A. and Zhou, J. (2010). Consistent Model Selection for Marginal General- 
ized Additive Model for Correlated Data. Journal of the American Statistical Associa- 
tion, 105, 1518-1530. 

[26] Yuan, M. and Lin, Y. (2006). Model Selection and Estimation in Regression with 
Grouped Variables. Journal of the Royal Statistical Society, Series B, 68, 49-67. 

[27] Zhang, Y., Li, R. and Tsai, C. L. (2010). Regularization Parameter Selections 
via Generalized Information Criterion. Journal of the American Statistical Association, 
105, 312-323. 

[28] Zhao, P. and Yu, B. (2006). On Model Selection Consistency of Lasso. Journal of 
Machine Learning Research, 7, 2541-2563. 

[29] Zou, H. (2006). The Adaptive Lasso and Its Oracle Properties. Journal of the American 
Statistical Association, 101, 1418-1429. 

[30] Zou, H. and Hastie, T. (2005). Regularization and Variable Selection via the Elastic 
Net. Journal of the Royal Statistical Society, Series B, 67, 301-320. 

[31] Zou, H., Hastie, T. and Tibshirani, R. (2007). On the "Degree of Freedom" of 
the Lasso. Annals of Statistics, 35, 2173-2192. 

[32] Zou, H. and Zhang, H. (2009). On the Adaptive Elastic-net with A Diverging Num- 
ber of Parameters. Annals of Statistics, 37, 1733-1751. 



29 



Table 1: The percentages of select 
simulation 1. 



-ing the true active set for various selection criteria in 



n 


Penalty 


Ks 


Cp 


BIC 


CV 


GCV 




Lasso 


0.63 


0.16 


0.29 


0.09 


0.16 


40 


Ada lasso 


0.98 


0.53 


0.75 


0.63 


0.52 




SCAD 


0.98 


0.55 


0.81 


0.76 


0.52 




Lasso 


0.81 


0.16 


0.35 


0.14 


0.17 


60 


Ada lasso 


0.99 


0.52 


0.87 


0.65 


0.52 




SCAD 


1 


0.58 


0.88 


0.76 


0.56 




Lasso 


0.89 


0.16 


0.38 


0.09 


0.16 


80 


Ada lasso 


0.99 


0.56 


0.88 


0.77 


0.56 




SCAD 


0.99 


0.62 


0.89 


0.75 


0.61 



Table 2: The averaged numbers of correctly selected zeros (C) and incorrectly selected zeros 
(I) for various selection criteria in simulation 1. 





Ks 


Ks 


Cp 


Cp 


BIC 


BIC 


CV 


CV 


GCV 


GCV 


n 


Penalty 


C 


I 


c 


I 


c 


I 


c 


I 


c 


I 




Lasso 


4.58 


0.01 


3.26 





3.68 





2.66 





3.25 





40 


Ada lasso 


4.98 





4.16 





4.59 





4.25 





4.15 







SCAD 


4.99 


0.01 


4.11 





4.63 





4.39 





4.06 







Lasso 


4.8 





3.12 





4 





2.85 





3.13 





60 


Ada lasso 


4.99 





4.17 





4.84 





4.35 





4.17 







SCAD 


5 





4.15 





4.84 





4.37 





4.12 







Lasso 


4.88 





3.01 





4.05 





2.66 





3 





80 


Ada lasso 


4.99 





4.19 





4.84 





4.49 





4.19 







SCAD 


4.99 





4.23 





4.83 





4.45 





4.22 
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Table 3: The percentages of selecting the true active set for various selection criteria in 
simulation 2. 



n 


Pn 


Penalty 


Ks 


Cp 


BIC 


CV 


GCV 






Lasso 


0.89 


0.11 


0.22 


0.09 


0.11 


100 


10 


Ada lasso 


1 


0.58 


0.89 


0.70 


0.58 






SCAD 


1 


0.58 


0.89 


0.80 


0.57 






Lasso 


0.96 


0.02 


0.09 





0.02 


200 


14 


Ada lasso 


1 


0.41 


0.93 


0.80 


0.42 






SCAD 


1 


0.43 


0.91 


0.77 


0.43 






Lasso 


1 


0.04 


0.07 


0.01 


0.04 


400 


20 


Ada lasso 


1 


0.3 


0.87 


0.72 


0.29 






SCAD 


1 


0.37 


0.88 


0.72 


0.37 






Lasso 


1 





0.03 








800 


28 


Ada lasso 


1 


0.22 


0.94 


0.77 


0.22 






SCAD 


1 


0.34 


0.98 


0.76 


0.34 



Table 4: The averaged numbers of correctly selected zeros (C) and incorrectly selected zeros 
(I) for various selection criteria in simulation 2. 





Ks 


Ks 


Cp 


Cp 


BIC 


BIC 


CV 


CV 


GCV 


GCV 


n 


Pn 


Penalty 


C 


I 


c 


I 


c 


I 


c 


I 


c 


I 






Lasso 


4.88 





2.80 





3.37 





2.60 





2.80 





100 


10 


Ada lasso 


5 





4.34 





4.84 





4.45 





4.34 









SCAD 


5 





4.32 





4.84 





4.64 





4.30 









Lasso 


8.96 





5.52 





6.70 





5.13 





5.53 





200 


14 


Ada lasso 


9 





7.71 





8.92 





8.37 





7.73 









SCAD 


9 





7.59 





8.89 





8.37 





7.58 









Lasso 


15 





9.52 





11.78 





9.24 





9.52 





400 


20 


Ada lasso 


15 





12.48 





14.83 





14.10 





12.47 









SCAD 


15 





12.60 





14.81 





14.06 





12.59 









Lasso 


23 





16.50 





19.44 





16.39 





16.40 





800 


28 


Ada lasso 


23 





19.95 





22.94 





22.54 





19.95 









SCAD 


23 





19.59 





22.98 





22.28 





19.59 
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Table 5: The selected active sets and the relative prediction errors (RPE) for various 
selection criteria in the prostate cancer example. 



Penalty 


Ks 


c P 


BIC 


CV 


GCV 


Active 


Lasso 


1,2,4,5 


1,2,3,4,5,6,7,8 


1,2,4,5 


1,2,3,4,5,7,8 


1,2,3,4,5,6,7,8 


Set 


Ada lasso 


1,2,5 


1,2,3,4,5 


1,2,3,4,5 


1,2,3,4,5,6,7,8 


1,2,3,4,5 




SCAD 


1,2,4,5 


1,2,3,4,5 


1,2,3,4,5 


1,2,3,4,5,6,7,8 


1,2,3,4,5 




Lasso 


0.734 


0.797 


0.734 


0.807 


0.797 


RPE 


Ada lasso 


0.806 


0.825 


0.825 


0.797 


0.825 




SCAD 


0.734 


0.825 


0.825 


0.797 


0.825 



Table 6: The percentage of selecting the true active set, the averaged number of correctly 
selected zeros (C) and incorrectly selected zeros (I), and the relative prediction error (RPE) 
of Algorithm 2 (Extended) compared with that of Algorithm 1 (Ks). 



Algorithms 


Percentage 


C 


I 


RPE (s.d.) 


Ks 


0.63 


4.58 


0.01 


0.088 (0.021) 


Extended 


0.45 


4.16 





0.100 (0.012) 
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Figure 1: Relative prediction errors (RPE) for various selection criteria in simulation 1, 
where 'K', 'Cp', 'B', 'C and 'G' represent the kappa selection criterion, Mallows' C p , BIC, 
CV and GCV, respectively. 
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Figure 2: The detection and sparsity of the lasso regression with the kappa selection criterion 
in simulation 1 are shown on the top, and the sensitivity of a to the relative prediction error 
is shown on the bottom. 




Figure 3: Relative prediction errors (RPE) for various selection criteria in simulation 2, 
where 'K', 'Cp', 'B', 'C and 'G' represent the kappa selection criterion, Mallows' C p , BIC, 
CV and GCV, respectively. 
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Figure 4: The detection and sparsity of the lasso regression with the extended selection 
criterion (denoted as Extended) in Algorithm 2 . 
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